Regular Expressions |
Regular Expressions are a powerful way to define patterns for searching and matching. Beyond Compare allows you to use regular expressions when searching through text, and when specifying rules for classifying text. The regular expression support in Beyond Compare is a subset of the Perl Compatible Regular Expression (PCRE) syntax. While Regular Expressions can be a complex topic, there are several excellent resources about them. One such resource is a book called Mastering Regular Expressions. Another excellent resource is Steve Mansour's A Tao of Regular Expressions, a copy of which can be found at: jmason.org/software/sitescooper/tao_regexps.html A regular expression is composed of two types of characters: normal characters and metacharacters. When performing a match, metacharacters take on special meanings, controlling how the match is made and serving as wildcards. Normal characters always match against only themselves. To match against a metacharacter, escape it, by prefixing it with a backslash "\". There are multiple types of metacharacters, each detailed below. Metacharacters - Escape Sequences
Metacharacters - Predefined classes Predefined character classes match any of a certain subset of characters. The following classes are already defined for you.
You can also construct your own character classes by surrounding a group of characters in brackets "[]". The predefined classes (except ".") can be used in the brackets, and if a dash "-" appears between two characters, it represents a range. Thus [a-z] would represent all lowercase letters, and [a-zA-Z] would represent both lower and uppercase letters. To include a "-" as part of the class, place it at the beginning or end of the string. If the first character within the brackets is a caret "^", then the class represents everything except the specified characters. [^a-z] matches on any character that isn't a lower-case alphabetic character. Metacharacters - Alternatives By placing an "|" between two groups of items, alternative matches can be represented. a|b will match either a or b. ab|cd will match "ab" or "cd", but not "ac". "|" groups characters from pattern delimiter ("(", "[", or the start of the pattern) to itself and then again to the end of the pattern. Alternatives can be placed within parenthesis "()" to make it obvious what is being matched against, as in a(bc|de)f. Alternatives are matched left to right; bey|beyond will match on bey, even if the string is "beyond". Metacharacters - Position The following metacharacters control where the match can occur on a line. Note: \A and \Z match the start and end of text respectively, but since Beyond Compare performs the search on a line by line basis, these have the same effect as ^ and $.
Metacharacters - Iterators Anything in a regular expression can be followed by an iterator metacharacter, which refers to the item before it. There are two kinds of iterators - greedy and non-greedy. Greedy iterators match as many as they can, non-greedy match as few as they can. Greedy:
Non-greedy:
Metacharacters - Subexpressions Parenthesis "()" can also be used to group characters for use with iterators and backreferences (discussed below). (bey){4,5} will match between 4 and 5 instances of "bey". (abc|[0-9])* will match any combination of "abc" and the digits 0 to 9. Eg. "abc5", "679abc" and "abc77abc". Metacharacters - Back References Each sequence of characters which is matched within a "()" will be saved as a subexpression, which you can refer to later with \1 to \9, which refer to the subexpressions from left to right. b(.)\1n will match "been" and "boon", but not "bean", "ben" or "beeen". Modifiers Modifiers allow changes to the matching behavior from that point on. If the modifier is contained within a subexpression, it affects only that subexpression. Use (?i) and (?-i) to control the case sensitivity of matching. Examples:
See also |